3. Search systems
Today's topics
Information retrieval systems
Interfaces of information search
Conclusions
Long history of information retrieval (IR) research
No new research topic on text search
New algorithms for images and movies
Information retrieval interfaces are far from ideal
Incremental search
Word suggestion / completion
Human life is the series of information searches
Most of our everyday life are related to searches
We seldom create information from scratch
Example 1: Sending an email message
Find the email application
Find the "compose" button
Find the destination address
Enter the message
Find attachment files
Find the "send" button
Example 2: various hobbies
Find a nice restaurant
Find an interesting TV program
Find an interesting movie
Find interesting news articles
Find a music
Find interesting exhibitions
Example 3: GUI widgets
Menus
Find programs and functions
Icons
Find programs and data
Folders
Find data based on hierarchical data structure
Scrollbar
Find something out of sight
History of information retrieval
Text search
Internet search
Human-powered search
Integrated search
The age of full-text search (〜1995)
「Information Retrieval」
ACM SIGIR (Special Interest Group on Information Retrieval)
Ordinary people never used search systems
Research was matured
Multimedia search (videos, music) not popular
Dawn of Internet(〜2000)
IR researchers began creating search engines
Lycos, AltaVista, Inktomi, ...
Ordinary people begay using text search systems
Search by human power(〜2010)
Google!
Use page ranks
Use the power of people
Social filtering
Social bookmarks
Search integration (~2020)
Various attributes used for search
location, time, etc.
Find information from vague keys
New search algorithms
SVM, LSI, ...
New AI-based search algorithms (~2030?)
Text search techniques
Data size and search methods
Huge
Use indexes
Small
Pattern matching
Middle
Combination of both
Long research history
Information Retrieval (IR)
ACM SIGIR (Special Interest Group on Information Retrieval)
Index database is created beforehand
Database search
Text search
Precision and recall
Precision
Correct results / All search results
Recall
Correct results / All correct results
Precision and recall
http://masui.sfc.keio.ac.jp/dbee8fd0f7e8f192bc024f4d49e9d4e9.graffle http://gyazo.com/aa28558496ec2d2ad76d23619ce71a43.png
Both value should be 100% ideally
Various tradeoffs
F value = $ 2 \over {{ 1 \over {Recall}} + {1 \over {Precision}}}
"harmonic mean" of precision and recall
Boolean search
Check if a word is included in a text
Use word-document matrix
http://gyazo.com/7e30f277ce8b72bdb57a37f9fde7c63b.png
Boolean search
Pros
AND/OR/NOT search available
e.g. Brutus AND Caesar but NOT Calpurnia
Cons
Huge matrix required
Ranking impossible
Inverted index
Compressing word-document matrix
https://gyazo.com/28862fde6cc350e2d73aa08958ae81ca
Handling words
Extract words from documents
Fairly easy in English
Difficult in Japanese
Normalize
Process variations (go, going, went ...)
Stemming
Remove "stop words" (prepositions, articles, etc.)
to be or not to be
Vector space mode
Represent a document as a word vector
Use inner product for calculating similarity
Word-document matrix
http://gyazo.com/5d8bf7d946262da0b78887494b276900.png
Vector space model
Add weights to words based on its frequency in documents
$ \bold{tf}: Term Frequency
$ \bold{idf}: Inverse Document Frequency
tf-idf
$ \bold{tfidf}_{ij} = \bold{tf}_{ij} \times \bold{idf}_iweight of $ iin document$ j
$ \bold{tf}_{ij} =$ \bold{n}_{ij} \over \sum_k{\bold{n}_{kj}}
$ \bold{n_{ij}} =frequency of word $ \bold{t}_iin document $ \bold{d_j}
this value is large if a term appears more times than other term
if $ \bold{d}_1is a b c c d e f, $ \bold{tf}_{c1}is$ 2/7、$ \bold{tf}_{a1}is$ 1/7
if $ \bold{d}_2is a b c d e f g h$ \bold{tf}_{c2}is $ 1/8、$ \bold{tf}_{a2}is $ 1/8
$ \bold{idf}_i = \log(\bold{N} / \bold{df}_i)
$ \bold{N} = total number of documents
$ \bold{df}_i =number of documents that include $ \bold{t}_i
becomes small if many documents include the term
number of documents is 10 and 2 of them include c, $ \bold{idf}_c = log(10/2) = 1.6 $ \bold{tfidf}_{c1} = 0.460 $ \bold{tfidf}_{c2} = 0.201
number of documents is 10 and 4 of them include a,$ \bold{idf}_a = log(10/4) = 0.91 $ \bold{tfidf}_{a1} = 0.131 $ \bold{tfidf}_{a2} = 0.114
As a result, if a term is frequently used in a specific document, tf-idf value becomes larger
Association search
https://gyazo.com/d58fe4fa74b11089a324873f52277a1d
Vector space search + modern interface
GETA
Word-document matrix (WAM: Word-Article Matrix) handling library
Association search library
Clustering library
https://gyazo.com/b22f90d8499368e7ee030f118f3164d6
https://gyazo.com/f4b2e26a6b1cc9ef0c069dbb90facf49
https://gyazo.com/3b96031051cbd5ffd233352961722d43
Relevance Feedback
Use the search results for next search
Can be manual or automatic
Use the best search result for next search
Use the words in the top result for next search
Search results improved
Problems of relevance feedback
Users don't send feedbacks
Longer query
Users are not interested recall ratio
Query expansion
http://gyazo.com/e696d55b51d26c7016a48c7e11ac976a.png
Use thesaurus and search log
Useful on Web keyword search
Probability model
Modeling the word sequences based on probability
Calculate the probability of a document matching a query word
Based on theory
tf-idf calculation is based on heuristics
LSI (Latent Semantic Indexing)
Use "singular value decompositions" (SVD) of word-document matrix
SVD calculation
Convert a word-document matrix to smaller-rank matrix
Smaller calculation + higher recall
Small-scale text search
Scan full text with text search algorithms
Use pattern match algorithms
Pattern matching algorithms
Perfect match algorithm
Regular expression
Approximate pattern matching
Perfect matching algorithm
Find a pattern P[1..m] in text T[1..n]
Time-consuming if characters are checked one by one from the left
Faster algorithms
Knuth-Morris-Pratt (KMP) method
Boyer-Moore method
Shifter algorithm
Simple algorithm
Compare T[1..m]and P[1..m]
T: aaaaaaaaaaaaaaaaab
P: aaaaaab
Compare T[2..m+1] and P[1..m]
T: aaaaaaaaaaaaaaaaab
P: aaaaaab
...
Compare T[n-m+1..n] and P[1..m]
$ n \times m comparesons required
Knuth-Morris-Pratt algorithm
Faster algorithm invented by Donald Knuth, et al.
P[7] (= b)で不一致が検出されたとき
前のテキストはaaaaaaであることが既知
P[6] (= a)の比較をやり直せばよい
不一致が検出されたときに比較をやりなおすべきPの文字の位置をあらかじめ「失敗関数」として計算
無駄な比較を繰り返さないようにする
http://gyazo.com/1d7b988f5207f5d70182f576cd28e525.png
Example of Knuth-Morris-Pratt algorithm
P = abcaba の場合
http://gyazo.com/981d1b79dceedfa85cf5995ea44be3b4.png
5 → 1 の遷移
5の状態でテキストの最後がabcaであることが既知
→2の状態まで戻れる
5から6に遷移できない場合
次の文字はbでない
→1の状態まで戻れる
Boyer-Moore algorithm
Fastest text comparison algorithm
Number of character comparisons can be smaller than the string length
All the characters should be checked in KMP method
Check the pattern from the right
1回目の文字比較でT[m] = c であることがわかる
cはPに含まれていないので2回目の比較はT[m+1]の先から調べればよい
Since P does not contain c, second check can be performed from T[m+1]
http://gyazo.com/c0ef58ec05df6c96c8591907363c7b2c.png
Boyer-Moore algorithm (Cont'd)
右からk文字だけ一致したときにその次にどこから比較を再開すればよいかをあらかじめパタンから計算
Use two pre-calculated tables
d1[] -- 右端で不一致が検出されたときの、移動数[不一致文字]の表
d2[] -- 右からk文字目で不一致が検出されたときの、移動数[不一致位置]の表
先の例の場合、d1['a'] = 1, d2['c'] = d1['d'] = ... = 8である。
"Shifter" algorithm
Represent the matching state by a $ mbit pattern
$ kth bit represents $ k character match
e.g. 11101 = mathing 2 characters
When the next character is $ c, left-shift this value one bit, and calculate the next state by OR with T[c]
T[c]はあらかじめ計算しておく値で、パタン文字==cの位置だけ0になっている。
T[c] is a pre-calculated pattern array
e.g. pattern abac → T['a'] = 0101
Shifter algorithm (Cont'd)
Sometimes faster than Boyer-Moore algorithm
Used in a utility agrep
Asearch
Approximate search library by Masui
https://gyazo.com/ae40ec58ecbabb3935f47f531340894f
Available on Ruby and Node
Regular expression
Powerful expression for specifying text patterns
Ambiguous matching patterns
Became popular on Unix tools
Regular expression patterns
Selection
abc or def
Character class
a or b or c
Repetition
more than 3 a's
Examples of regular expression
abc
(abc|def)
[abc]
ab*c
Examples of regular expression
(時計|時間|時刻)を([0-9]*)時に(セットする|設定する|あわせる)
ファイルを((きっちり(きっちり)?)?)消す
Generative grammer
Grammatical rules for generating sentences
Terminal symbols (e.g. "mountain") and non-terminal symbols (e.g. noun)
noun ← "mountain"
adjective ← "high"
noun phrase ← adjective noun
Regular grammar
A grammer representing regular expression
Only the form A ← aB (A/B: nonterminal, a: terminal)
Handling regular expression
Convert to state transition machine
e.g. a(b|cd) and its grammer and state transition machine
A ← aS
B ← bA
C ← cA
B ← dC
http://gyazo.com/ef47894d05a71d085fb31cdd9527910e.png
Pattern matching algorithms for regular expression
Aho-Corasick method
grep method
egrep method
Theory of regular expression algorithms
https://www.amazon.co.jp/dp/B07JHRL2NS https://gyazo.com/a7cf80318f7d193f583a6f9dbeb36562
Aho-Corasick method
Understand (string1|string2|...) pattern
Extention to KMP algorithm
Used in fgrep command
Exmple: ac|baa|bacd|ba|bb
http://gyazo.com/7926b6c03cc66c5f883d4aa2a4a14da7.png
grep algorithm
Used in grep command in Unix
Recursively call functions that take care of *、?, etc.
Cannot handle patterns like (abc|def)
egrep algorithm
Used in egrep command in UNIX
Can handle all regular expressions
Algorithm
Convert a regular expression to NDFA(Non-deterministic finite automaton)
Convert NDFA to DFA (Deterministic finite automaton)
Create a state transition table of DFA
Fast pattern matching
Example of NDFA
pattern (SUB)*SECTION
http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png
Convert NDFA to DFA
http://gyazo.com/d9e5779fe46637b3fedf7783e0e2d1d3.png
Generated DFA
http://gyazo.com/c74f45af19228955d3a50cb796d6781b.png
(Original NDFA)
http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png
What a regular expression cannot do
Cannot count numbers
No state is used
Cannot handle approximate matching
Middle-level pattern matching
Don't use large index
"Signature method"
Also know as "BitFunnel"
Express a word as a bit pattern
「express」→ 0000000100000000
「word」→ 0000000000001000
Use or for the combination
「express word」→ 0000000100001000
Use and operation between search keywords and document pattern
Information retrieval and direct manipulation
Direct manipulation
Contiinuous
Reversible
Concrete
Information retrieval based on direct manipulation
Result changes as soon as search condition changes
Go back to the previous state by undo
Not popular even today
Incremental search
Search result changes as soon as keyword changes
Emacs search
Norikae-annai (changing trains)
Google Suggest
http://gyazo.com/f0e800ae8648adac94c020bd61951a79.png
Demo: Norikae-an'nai
Migemo
Japanese incremental search on Emacs
http://gyazo.com/de4be34b79fe5fab36f765428f69c806.png
Demo: Migemo
Limitation of text search
Are you satisfied with Google search?
Good search keyword required
Difficult to search from vague information
Good restaurant
Interesting book
Google is just a keyword search system
Other services for specific needs
Search methods other than keyword-based search
Page rank
Link-based search
Zooming search
Location-based search
Approximate pattern matching
Faceted search
Page rank
Invented by Larry Page (Founder of Google)
Use link information for calculating weights
Page rank becomes large if linked from important pages
Basically based on human power
Larry Page
https://gyazo.com/5a7ea84315f9ec3569afb46f070aa3d3
https://gyazo.com/43e0c52c1f5ab35dab98ed2b2ba40d90
X=Z Xis receiving the only link fromZ
Y=X/2 Y is receiving one of the two links fromX
Z=X/2+Y Z is receiving one of the two links from X, and receiving the only link from Y
⇒ X : Y : Z = 0.4 : 0.2 : 0.4
Link-based search
What is this?
http://gyazo.com/03435fbba38fe55a3572342c2bde1118.png
I could find it through the grapevine
How I found the info
http://gyazo.com/03435fbba38fe55a3572342c2bde1118.png
Draw one pattern
⇒ Ask people if they know it
⇒ One person said it is "Hana-chiru-sato"
⇒ Found it on google
Link-based search
Put similar information together
Similar date, content, foldeer, etc.
http://gyazo.com/49ff2980552894951bb1a1e0e12368a7.png
Naboring info
http://gyazo.com/9ff211e2c2f30b87fe047ecff7bd0f49.png
Link-based search on Scrapbox
https://gyazo.com/2c70348a630a341146bc9fcba90b3111 https://scrapbox.io/UIPedia/Scott%20R.%20Klemmer
dshelf
Browsing contents only using arrow keys
Listed books are similar in some sense
https://gyazo.com/21eb8e60eb236cc47eb96118b6fe501a.png
Use information visualization technique
https://gyazo.com/2f8eb8956a9ba8c8838a9c23a47718d2
Location-based search
Powerful, but not popular yet
https://gyazo.com/65e2fe004ba378b99a265c14b681249d
https://gyazo.com/b545e286d2692b42ade2ec90b11af33f http://www.pitecan.com/tmp/RainbowZoomer/photobrowser2.html#%E3%83%AF%E3%82%A4%E3%83%B3
"Do you mean?" search
Sometimes used in Google search
Categories (facets) are provided beforehand
Users can select facets for
Simply reduce the size of target database
Popular on various EC sites
http://gyazo.com/13d047dd83955b6d5ea69bdae0390aea.png http://scope.knowledge-works.co.jp/2012/06/%E3%83%87%E3%82%A3%E3%83%AC%E3%82%AF%E3%83%88%E3%83%AA%E6%A4%9C%E7%B4%A2-vs-%E3%83%95%E3%82%A1%E3%82%BB%E3%83%83%E3%83%88%E6%A4%9C%E7%B4%A2/
https://gyazo.com/f1b22415869c2dcc430f084475f25c2a
Helpfeel
https://gyazo.com/883080ab4c36aa6200dd691e6a619e6f https://helpfeel.notainc.com/SFCHelp/?q=
Helpfeel
https://youtu.be/Nbwh7KrfIVA
https://youtu.be/QjiNaYfQ4Ak
Helpfeel taxi ad
https://vimeo.com/539549975
Other information retrieval systems
Image search
Video search
Audio search
Future of search interface
Integration of new search algorithms and interaction techniques]
Integration of search / input / registration